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Abstract 

We present an exhaustive study of more than 250 ab initio potential 
energy surfaces (PESs) of the model dipeptide HCO-L-Ala-NH 2 . The 
model chemistries (MCs) used are constructed as homo- and heterolevels 
involving possibly different RHF and MP2 calculations for the geometry 
and the energy. The basis sets used belong to a sample of 39 selected 
representants from Pople's split-valence families, ranging from the small 
3-21G to the large 6-311++G(2df,2pd). The reference PES to which the 
rest are compared is the MP2/6-311++G(2df,2pd) homolevel, which, as 
far as we are aware, is the more accurate PES of a dipeptide in the lit- 
erature. The aim of the study presented is twofold: On the one hand, 
the evaluation of the influence of polarization and diffuse functions in 
the basis set, distinguishing between those placed at lst-row atoms and 
those placed at hydrogens, as well as the effect of different contraction 
and valence splitting schemes. On the other hand, the investigation of 
the heterolevel assumption, which is defined here to be that which states 
that heterolevel MCs are more efficient than homolevel MCs. The het- 
erolevel approximation is very commonly used in the literature, but it is 
seldom checked. As far as we know, the only tests for peptides or related 
systems, have been performed using a small number of conformers, and 
this is the first time that this potentially very economical approximation is 
tested in full PESs. In order to achieve these goals, all data sets have been 
compared and analyzed in a way which captures the nearness concept in 
the space of MCs. 
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The most important results of the study are the following: First, that 
the convergence in method is not achieved in the RHF — > MP2 step. Sec- 
ond that the transferability of basis set accuracy from RHF to MP2 is im- 
perfect. These two conclusions lead us to discourage the use of RHF MCs 
for peptides. Regarding the relative efficiency of the Pople's basis sets, we 
recommend the inclusion of polarization functions in lst-row atoms and 
we discourage the use of basis sets containing doubly-split polarization 
shells and no diffuse functions. Also, we have found that 6-31G(d) is very 
efficient for calculating the geometry, and that both the RHF and MP2 
infinite basis set limits are approximately reached at 6-311++G(2df,2pd). 
Finally, related to the heterolevel approximation, we conclude that it is 
essentially correct for the description of the conformational behaviour of 
HCO-L-Ala-NH 2 both at RHF and MP2. Nevertheless, we place a cau- 
tionary remark on the use of RHF geometries with MP2 single-points: 
Whereas this practice could be accurate enough for short peptides, the 
accumulation of errors may render it unreliable for longer chains and re- 
quire the use of MP2 geometries. 

PACS: 07.05. Tp, 31.15.Ar, 31. 50. Be, 87.14.Ee, 87.15. Aa, 89.75.-k 

1 Introduction 

In any bottom-up approach to the still unsolved protein folding problem [1-4] , 
the characterization of the conformational behaviour of short peptides [5-12] 
constitutes an unavoidable first step. If high accuracy of the treatment is sought, 
numerically expensive methods must be used to calculate the physical properties 
of these protein subunits. This is why, the most frequently studied peptides are 
the shortest ones: the dipeptides [13-17], in which a single amino acid residue 
is capped at both the N- and C-termini with neutral peptide groups. Among 
them, the most popular choice has been the alanine dipeptide [9,14,18-33], 
which, being the simplest chiral residue, shares many similarities with most of 
the rest of dipeptides for the minimum computational price. 

Although classical force fields [34-42] are the only feasible choice for simu- 
lating large molecules at present, they have been reported to yield inaccurate 
potential energy surfaces (PESs) for dipeptides [14, 26, 43-46] and short pep- 
tides [9,47]. Therefore, it is not surprising that they are widely recognized as 
being unable to capture the fine details needed to correctly describe the in- 
tricacies of the whole protein folding process [43,48-54]. On the other hand, 
albeit prohibitively demanding in terms of computational resources, ab initio 
quantum mechanical calculations [55-57] are not only regarded as the correct 
physical description that in the long run will be the preferred choice to directly 
tackle proteins (given the exponential growth of computer power) , but they are 
also used in small peptides as the reference against which less accurate methods 
must be compared [9,14,26,43,44,46,58,59] in order to parameterize improved 
generations of additive, classical force fields for polypeptides. 

However, despite the sound theoretical basis, in practical quantum chemistry 
calculations a plethora of approximations must be typically made if one wants 
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to obtain the final results in a reasonable human time. The exact 'recipe' 
that includes all the assumptions and steps needed to calculate the relevant 
observables for any molecular system has been termed model chemistry (MC) 
by John Poplc. In his own words, a MC is an "approximate but well-defined 
general and continuous mathematical procedure of simulation" [60]. 

The two starting approximations to the exact Schrodingcr equation that a 
MC must contain are, first, the truncation of the A^-electron space (in wavefunc- 
tion-based methods) or the choice of the functional (in DFT) and, second, the 
truncation of the one-electron space, via the LCAO scheme (in both cases). The 
extent up to which the first truncation is carried (or the functional chosen in 
the case of DFT) is commonly called the method and it is denoted by acronyms 
such as RHF, MP2, B3LYP, CCSD(T), FCI, etc., whereas the second truncation 
is embodied in the definition of a finite set of atom-centered Gaussian functions 
termed basis set [56,57,61,62], which is also designated by conventional short 
names, such as 6-31+G(d), TZP or cc-pVTZ(-f). If we denote the method by a 
capital M and the basis set by a B, the specification of both is conventionally 
denoted by M/B and called a level of the theory. Typical examples of this are 
RHF/3-21G or MP2/cc-pVDZ [55-57]. 

Such levels of the theory are, by themselves, valid MCs, however, it is very 
common [9,31,63,64] to use different levels to perform, first, a (possibly con- 
strained) geometry optimization and, then, a single-point energy calculation on 
top of the resulting structures. If we denote by Li := Mi/Bi a given level of 
the theory, this 'mixed' calculation is indicated by Le//Lg, where the level Le 
at which the single-point energy calculation is performed is written first [60]. 
Herein, if Le 7^ Lg, we shall call Le//Lg an heterolevel MC; whereas, if 
Le = Lg it will be termed a homolevel one, and it will be typically abbre- 
viated omitting the 'double slash' notation. 

Apart from the approximations described above, which are the most com- 
monly used and the only ones that arc considered in this work, the MC concept 
may include a lot of additional features: protocols for extrapolating to the 
infinite-basis set limit [65-69], additivity assumptions [70-73], extrapolations 
of the M0ller-Plcsset series to infinite order [74] , removal of the so-called basis 
set superposition error (BSSE) [75-81], etc. The reason behind most of these 
techniques is the urging need to reduce the computational cost of the calcula- 
tions. For example, in the case of the heterolevel approximation, this economy 
principle forces the level Le at which the single-point energy calculation is per- 
formed to be more accurate and more numerically demanding than Lg', the 
reason being simply that, while we must compute the energy only once at Le, 
we need to calculate several times the energy and its gradient with respect to 
the unconstrained internal nuclear coordinates at level Lg (the actual number 
of times depending on the starting structure, the algorithms used and the size of 
the system) . Therefore, it would be pointless to use an heterolevel MC L E // Lq 
in which Lg is more expensive than Le, since, at the end of the geometry 
optimization, the energy at level Le is available. 

Now, although general applicability is a requirement that all MCs must 
satisfy, general accuracy is not mandatory. Actually, the fact is that the different 
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procedures that conform a given MC are typically parameterized and tested in 
very particular systems, which are often small molecules. Therefore, the validity 
of the approximations outside that native range of problems must be always 
questioned and checked, but, while the computational cost of a given MC is 
easy to evaluate, its expected accuracy on a particular problem could be difficult 
to predict a priori, specially if we arc dealing with large molecules in which 
interactions in very different energy scales are playing a role. The description 
of the conformational behaviour of peptides (or, more generally, flexible organic 
species), via their PESs in terms of the soft internal coordinates, is one of such 
problems and the one that is treated in this work. 

Our aim here is to provide an exhaustive study of the Restricted Hartree Fock 
(RHF) and M0ller-Plesset 2 (MP2) methods, using the split-valence families of 
basis sets devised by Pople and collaborators [82-89]. To this end, we compare 
the PES of the model dipeptide HCO-L-Ala-NH 2 (sec fig. [TJ calculated with 
a large number of homo- and heterolevel MCs, and assess their efficiency by 
comparison with a reference PES. Special interest is devoted to the evaluation of 
the influence of polarization and diffuse functions in the basis sets, distinguishing 
between those placed at f st-row atoms and those placed at hydrogens, as well 
as the effect of different contraction and valence-splitting schemes. 

The second objective of this study, and probably the main one, is the inves- 
tigation of the heterolevel assumption, which is defined here to be that which 
states that heterolevel MCs are more efficient than homolevel ones. The het- 
erolevel assumption is very commonly assumed in the literature [14, 29, 3f, 32, 
46,59,63,64,90], but it is seldom checked. As far as we know, the only tests 
for peptides or related systems, have been performed using a small number of 
conformers [9,15,17,23,91], and this is the first time that this potentially very 
economical approximation is tested in full PESs. 

In sec. 12. 1[ the methodological details regarding the quantum mechanical 
calculations performed in this work are provided. In sec. 12. 21 a brief summary 
of the meaning and the properties of the distance introduced in ref. 92 and 
used for comparing the different MCs is given for reference. Next, in sec. 12.31 
we discuss the rules and criteria that have been used in order to reasonably 
sample the enormous space of all Pople's basis sets. In sec. [3J the main results 
of the investigation are presented. For convenience, they are organized into four 
different subsections: in sec. 13.11 a RHF//RHF-intramethod study is performed, 
whereas the MP2 analogous is presented in sec. 13.21 In sec. 13.31 a small interlude 
is dedicated to reflect on the general ideas and the nearness concept in the space 
of MCs that underlie an investigation such as this one, and also to compare 
the RHF and MP2 results obtained in the previous two sections. In sec. 13.41 
heterolevel MCs in which the geometry is calculated at RHF and the energy at 
MP 2 are evaluated. Finally, sec. 2] is devoted to give a brief summary of the 
most important conclusions of the work. 
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2 Methods 



2.1 Quantum mechanical calculations and internal coor- 
dinates 

All ab initio quantum mechanical calculations have been performed using the 
Gaussian03 program [94] under Linux and in 3.20 GHz PIV machines with 2 GB 
RAM memory. The internal coordinates used for the Z-matrix of the HCO-L- 
Ala-NH2 dipeptide in the Gaussian03 input files (automatically generated with 
Perl scripts) are the Systematic Approximately Separable Modular Internal Co- 
ordinates (SASMIC) ones introduced in ref. 93. They are presented in tabic [1] 
(see also fig. Q] for reference). For the geometry optimizations, the SASMIC 
scheme has been used too (Opt=Z-matrix option) instead of the default redun- 
dant internal coordinates provided by Gaussian03, since we have seen that, when 
soft coordinates, such as the Ramachandran angles, arc held fixed and mostly 
hard coordinates arc let vary, the use of the SASMIC scheme slightly reduces 
the time to converge with respect to the redundant internals (for unconstrained 
optimizations, on the other hand, the redundant coordinates seem to slightly 
outperform the SASMIC ones). 

All PESs in this study have been discretized into a regular 12 x 12 grid in the 
bidimcnsional space spanned by the Ramachandran angles <j> and ifj, with both 
of them ranging from —165° to 165° in steps of 30°. 

To calculate the geometry at a particular level of the theory, we have run 
constrained energy optimizations at each point of the grid, freezing the two 
Ramachandran angles <f> and ip at the corresponding values, and, in order to 
save computational resources, the starting structures were taken, when possi- 
ble, from PESs previously optimized at a lower level of the theory. The con- 
vergence criterium for RHF optimizations has been set to 0pt=Tight, while, in 
the case of MP2, an intermediate option of I0p( 1/7=100) has been used (note 




Figure 1: Atom numeration of the protected dipeptide HCO-L-Ala-NH2 accord- 
ing to the SASMIC scheme introduced in ref. 93. The soft Ramachandran angles 
4> and ip are also indicated. 
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Atom name Bond length Bond angle Dihedral angle 



Hi 
C 2 
N 3 
4 
C 5 
H (i 
C 7 
C 8 
H 9 



(2,1) 

(3,2) 

(4,2) 

(5,3) 

(6,3) 

(7,5) 

(8,5) 

(9,5) 

(10,8) 

(11,8) 

(12,8) 

(13,7) 

(14,7) 

(15,13) 

(16,13) 



(3,2,1) 

(4,2,1) 

(5,3,2) 

(6,3,2) 

(7,5,3) 

(8,5,3) 

(9,5,3) 

(10,8,5) 

(11,8,5) 

(12,8,5) 

(13,7,5) 

(14,7,5) 

(15,13,7) 

(16,13,7) 



0:^(7,5,3,2) 



(8,5,3,7) 

(9,5,3,7) 

(10,8,5,3) 

(11,8,5,10) 

(12,8,5,10) 



(4,2,1,3) 
(5,3,2,1) 
(6,3,2,5) 



Hio 
Hn 

Hl2 
N13 
Ol4 
H15 
Hi6 



i, :=(13,7,5,3) 



(14,7,5,13) 
(15,13,7,5) 
(16,13,7,15) 



Table 1: Internal coordinates in Z-matrix form of the protected dipeptide HCO-L- 
Ala-NH2 according to the SASMIC scheme introduced in ref. 93. The numbering of 
the atoms is that in fig. [1] and the soft Ramachandran angles (f> and if) are indicated. 

that 0pt=Tight corresponds to I0p(l/7=10), whereas the default criterium is 
I0p(l/7=300)). The resulting geometries have been automatically extracted by 
Perl scripts and used to construct the input files for the heterolevel calculations. 

The self-consistent Hartree-Fock convergence criterium has been set in all 
cases to SCF=(Conver=10) (tighter than SCF=Tight) and the MP2 calculations 
have been performed in the (default) frozen-core approximation. 

At the Hartree-Fock level, 142 PESs have been calculated, taking a total of 
~ 3.7 years of computer time, whereas, at MP2, 35 PESs have been computed 
and the time invested amounts to ~ 4.5 years, from which, the highest level PES 
computed in this study, the MP2/6-311+- |-G(2df,2pd) one depicted in figlH has 
taken ~ 3 years of computer time. Finally, 88 PESs have been calculated with 
MP2//RHF-intermethod MCs, taking ~ 6 months. In total, 265 PESs of the 
model dipeptide HCO-L-Ala-NH2 have been computed for this study, taking 
~ 8.7 years of computer time. All these PESs are available as supplementary 
material, see the Conclusions for access information. 

Finally, let us note that the correcting terms to the PES coming from mass- 
metric tensors determinants have been recently shown to be relevant for the 
conformational behaviour of peptides [33] . Although, in this study, we have not 
included them, the PES calculated here is the greatest part of the effective free 
energy [33], so that it may be considered as the first ingredient for a further 
refinement of the study in which the correcting terms are taken into account. 
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2.2 Physically meaningful distance 



In order to compare the PESs produced by the different (homo- and heterolevel) 
MCs, a statistical criterium (distance) introduced in ref. 92 has been used. Let 
us recall here that this distance, denoted by di2, profits from the complex nature 
of the system studied to compare any two different potential energy functions, 
V\ and V-2- From a working set of conformations (in this case, the 144 points 
of each PES), it statistically measures the typical error that one makes in the 
energy differences if V2 is used instead of the more accurate V\, admitting a 
linear rescaling and a shift in the energy reference. 

This distance, which has energy units, presents better properties than other 
quantities customarily used to perform these comparisons, such as the energy 
RMSD, the average energy error, etc., and it may be related to the Pearson's 
correlation coefficient r\2 by 



where 02 is the standard deviation of V2 in the working set. 

Also, due to its physical meaning, it has been argued in ref. 92 that, if the 
distance between two different approximations of the energy of the same system 
is less than RT, one may safely substitute one by the other without altering 
the relevant dynamical or thermodynamical behaviour. Consequently, we shall 
present the results in units of RT (at 300° K, so that RT ~ 0.6 kcal/mol). 

Finally, if one assumes that the effective energies compared will be used to 
construct a polypeptide potential and that it will be designed as simply the 
sum of mono-residue ones [95], then, the number A res of residues up to which 
one may go keeping the distance d\2 between the two approximations of the the 
A-residue potential below RT is [92] 



Now, according to the value taken by A res for a comparison between a fixed 
reference PES, denoted by V\, and a candidate approximation, denoted by V2, 
we divide all the efficiency plots in sec [3] in three regions depending on the ac- 
curacy: the protein region, corresponding to < d\2 < 0.1RT, or, equivalcntly, 
to 100 < A ros < 00; the peptide region, corresponding to 0.1-RT < di2 < RT, or 
1 < Ar CS < 100; and, finally, the inaccurate region, where d\2 > RT, and even 
for a dipeptide it is not advisable to use V2 as an approximation to V\ . 

2.3 Basis set selection 

In the whole study presented in this work, only Poplc's split-valence basis sets 
[82-89] have been investigated. Among the many reasons behind this choice, 
we would like to mention the following ones: 



dri = V2<J2(l-r 2 12 ) 1/2 



(1) 




(2) 
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• They are very popular and they are implemented in almost every quantum 
chemistry package, in such a way that they are readily available for most 
researchers and the results here may be easily checked or extended. 

• There exist a lot of data calculated using these basis sets in the litera- 
ture, so that the knowledge about their behaviour in different problems is 
constantly growing and may also be enriched by the study presented here. 

• Poplc's split-valence basis sets incorporate, and hence allow to investi- 
gate, most of the features and improvements that are commonly used in 
the literature, such as contraction, valence-splitting, diffuse functions and 
polarizations. 

• Unlike some other popular basis sets, such as Dunning's correlation-consist- 
ent family [96] , in which the diffuse or polarization functions are added in 
preset groups, Poplc's basis sets allow for the addition of individual shells 
rather independently, thus permitting a more in depth study. 

• The number of different basis sets available is very large (see, for example 
the EMSL database at http://www.emsl.pnl.gov/forms/basisform. 
html), so that, for obvious computational reasons, one cannot explore 
them all, and some choice must be made. 

Now, even restricting oneself to this particular family of basis sets, the num- 
ber of variants that can be formed by independently adding each type of diffuse 
or polarization function to each one of the basic 6-31G and 6-311G sets is huge 
(to get to the sought point, there is no need to consider the addition of func- 
tions to 3-21G, 4-31G, etc.). Using that the largest set of diffuse and polarization 
shells that we may add is the '++G(3df,3pd)' one [85], we can express the dif- 
ferent basis sets that may be constructed as a product of all the independent 
possibilities: 



{6-31G,6-311G| x { ■ ,+} *{•>+} x{-,d,2d,3d) 

I- J I J Ist-row ^ J hydrogen I ) Ist-row 

x{.,p,2p,3p| xf-.fj x{.,d} , (3) 

l. J hydrogen ^ ) Ist-row ^ J hydrogen 

where the dot • indicates here that no function is added from a particular 
group. 

Therefore, there are 2x2x2x4x4x2x2 = 512 different Pople's split- 
valence basis sets just considering the 6-31G and 6-311G families. This number 
is prohibitively large to carry out a full study even at the RHF level, so that, 
here, the following strategy has been devised to render the investigation feasible: 

To begin with, we impose several constraints on the basis sets that will be 
considered in a first stage: 

(i) The maximum set of diffuse and polarization shells added is '++G(2df, 
2pd)', instead of '++G(3df,3pd)'. This is consistent with the thumb-rule 
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First-stage, rules-complying basis sets (24) 

3-21G 6-31G 6-311G 

3-21G(d,p) 6-31G(d,p) 6-311G(d,p) 
3-21++G 6-31G(2d,2p) 6-311G(2d,2p) 

3- 21++G(d,p) 6-31G(2df,2pd) 6-311G(2df,2pd) 

4- 31G 6-31++G 6-311++G 
4-31G(d,p) 6-31++G(d,p) 6-311++G(d,p) 
4-31++G 6-31++G(2d,2p) 6-311++G(2d,2p) 
4-31++G(d,p) 6-31++G(2df,2pd) 6-311++G(2df,2pd) 

First violation of the rules (5) 

6-31+G(d,p) 6-31++G(d) 6-3lG(f,d) 
6-31-+G(d,p) 6-31++G(-,p) 

Second violation of the rules (10) 

4-31G(d) 6-311G(d) 6-31G(d) 

4-31+G(d) 6-311+G(d) 6-31+G(d) 
4-31+G(d,p) 6-311+G(d,p) 
4-31++G(d) 6-311++G(d) 



Table 2: Basis sets investigated in this work. They are organized in three groups: the 
first one contains the basis sets that comply with some heuristic restrictions commonly 
found in the literature; in the second group, these restrictions are broken in an ex- 
ploratory manner; finally, in the third group, 10 new basis sets are selected according 
to what has been learned by violating the rules. The number of basis sets in each 
group is shown in brackets, the dot ■ is used to indicate that no shell of a particular 
type is added to the lst-row atoms, and the largest basis set is written in bold face. 
See also fig. 



concept of balance [56], according to which, the most efficient (balanced) 
basis sets are typically those that contain, for each angular momentum I, 
one shell more than the ones included for so that 6-311++G(3df,3pd), 
for example, should be regarded as unbalanced. 

(ii) There must be the same number and type of shells in hydrogens as in 
lst-row atoms. This has to be interpreted in the proper way: for example, 
if we add two d-type polarization shells to lst-row atoms, we must add 
two p-type ones to hydrogens. They are of the same type in the sense that 
they are one momentum angular unit larger than the largest one in the 
respective valence shell. 

(iii) The higher angular momentum f-type (for lst-row atoms) and d-type shells 
(for hydrogens), are not included unless the lower polarizations are doubly 
split, i.e., unless we have already included the (2d,2p)-shclls. This is again 



consistent with the balance rule mentioned in point (i). 

(iv) The investigation of smaller basis sets is restricted to the 3-2 1G and 
4-31G families, and the largest set of extra shells that is added to them is 
'++G(d,p)\ For consistence, the diffuse and polarization functions used 
for 3-21G and 4-31G are the same as the ones for 6-31G and 6-311G 
[84,85,88,89]. 

These rules, most of which arc typically obeyed (often tacitly) in the lit- 
erature [8,11,15,17,23-25,28,29,31-33,71], produce the list of 24 basis sets 
labeled as 'First-stage, rules-complying basis sets' in table [5] and depicted as 
black circles in fig. [H 

Even if their exhaustive study is already a demanding computational task 
and the space of all Pople's split-valence basis sets may be thought to be rea- 
sonably sampled by this 'first-stage' group, we wanted to check the validity of 
some of the rules, since, in the same spirit of the arguments given in the in- 
troduction, what is good for a particular system or a particular purpose is not 
necessarily good for a different one, which may be far away from the native 
playground where the methods have been traditionally tested and parameter- 
ized. Therefore, to this end, we have chosen the medium-sized and reasonably 
RHF-cfficicnt 6-31++G(d,p) basis set (see sec. 13. ip . and we have modified it in 
order to break restrictions (ii) and (iii). On the one hand, as representants of 
breaking rule (ii), we have selected the basis sets 6-31+G(d,p), 6-31++G(d), 
6-31 ■ +G(d,p) and 6-31++G( • ,p), where, in the first two diffuse and a 

polarization shell has been respectively removed from the hydrogens, while, in 
the last two ones, the removal has been carried out on the lst-row atoms. This 
second modification is so unusual (in fact, we have not found any work where it 
is performed) that there is no notation for it in the literature; herein, a dot • is 
used in the place where the unexisting lst-row-atom shell would appear. On the 
other hand, as a representant of breaking rule (iii), we have selected 6-31G(f,d). 
This new group of 5 basis sets is labeled as 'First violation of the rules' in tabled 
and depicted as grey-filled circles in fig. [51 We have decided to violate neither 
rule (i), mainly for computational reasons, nor rule (iv), due to the fact that 
the study of the smaller basis sets is intended to be only exploratory (and, in 
any case, the 3-21G and 4-31G families have proved to be rather inefficient for 
this problem; see sec. [3]). 

The conclusions extracted from the study of the 'first violation of the rules' 
group within RHF are discussed later, however, it suffices to say for the moment 
that we learn from them that breaking rule (iii) is not advantageous, and that 
one may benefit from breaking rule (ii) only if the functions are removed from 
the hydrogens. Therefore, in the final step of the selection of the basis sets 
that shall be investigated, we include a new group of 10 basis sets which come 
from removing hydrogen- atom diffuse and/or polarization shells from some of 
the most efficient ones in the other two groups. This new block is labeled as 
'Second violation of the rules' in tableland depicted as white-filled circles in 

fig. El 
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a 



++G 



++G(d,p) 



3-21 G 



++G 



4-3 1G 



■+G(d,p) 



6-3 1G 




G(d,p) 



G(2d,2p) G(2df,2pd) G(f,d) 



++G 



++G(d,p) ++G(2d,2p) ++G(2df„2pd) 



6-31 1G 




G(d,p) 



G(2d,2p) G(2df,2pd) 



Figure 2: Basis sets investigated in this work. They are organized in three 
groups: the first one, depicted as black circles, contains the basis sets that 
comply with some heuristic restrictions commonly found in the literature; in the 
second group, represented as grey-filled circles, these restrictions are broken in 
an exploratory manner; finally, in the third group, shown as white-filled circles, 
10 new basis sets are selected according to what has been learned by violating 
the rules. In (a), a general view of all the 39 basis sets is presented, while in 
(b), the left-most region of the 6-3 1G family has been enlarged so that the basis 
sets belonging to the second and third groups could be more easily appreciated. 
The dot • is used to indicate that no shell of a particular type is added to the 
lst-row atoms, and the geometric arrangement of the basis sets has no deep 
meaning whatsoever, it is only intended to provide visual comfort. See also 
table El 
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The basis sets used in the RHF part of the study are those in table [3J 
whereas, in the MP 2 part, we have considered the smaller subgroup that may 
be found in tabled] (see also fig. (H). All of them have been taken from the 
Gaussian03 internally stored library except for 6-31 ■ +G(d,p), 6-31++G( • ,p) 
and all the basis sets generated from the 3-2 1G and 4-3 1G ones by adding 
extra functions. The first two have no accepted notation and cannot be spec- 
ified in the program, while the ones derived from 3-21G and 4-31G have been 
constructed, for consistence, using the diffuse and/or polarization shells of the 
6-31G and 6-311G families. For these exceptions, the data has been taken from 
the EMSL repository at http://www.emsl.pnl.gov/forms/basisform.htmlQ 
and the basis sets have been read using the Gen keyword. In all cases, spherical 
Gaussian-type orbitals (GTOs) have been preferred, thus having 5 d-type and 
7 f-type functions per shell. 

3 Results 

3.1 RHF//RHF-intramethod model chemistries 

The study in this work begins by performing an exhaustive comparison of all 
the homolevel MCs and most of the heterolevel ones that can be constructed 
using the 39 different basis sets described above and within the RHF method. 
The original aim was to identify, separately, the most efficient basis sets for do- 
ing geometry optimizations and those that perform best for single-point energy 
calculations, in order to extract the information needed to carry out, in succes- 
sive stages, a (necessarily) more restrictive study of MP2//MP2 and MP2//RHF 
MCs. However, due to the considerations made in sec. 13.31 all mentions to the 
accuracy of any given MC in this section must be understood as relative to 
the RHF //RHF reference, and not to the (surely better) MP2//MP2 one or to 
the exact result. In this spirit, this part of the study should be regarded as 
an evaluation of the most efficient MCs for approximating the infinite basis set 
Hartree-Fock limit (for which the best RHF //RHF homolevel here is probably a 
good approximation), and also as a way of introducing the relevant concepts and 
the systematic approach that shall be used in the rest of the computationally 
more useful sections. 

Having this in mind, the efficiency of a particular MC is laxly defined as 
a balance between accuracy (in terms of the distance introduced in sec. 12. 2[) 
and computational cost (in terms of time) . It is graphically extracted from the 
efficiency plots, where the distance d 12 between any given MC and a reference 

Basis sets were obtained from the Extensible Computational Chemistry Environment 
Basis Set Database at http://www.emsl.pnl.gov/forms/basisform.html, Version 02/25/04, 
as developed and distributed by the Molecular Science Computing Facility, Environmental and 
Molecular Sciences Laboratory which is part of the Pacific Northwest Laboratory, P.O. Box 
999, Richland, Washington 99352, USA, and funded by the U.S. Department of Energy. The 
Pacific Northwest Laboratory is a multi-program laboratory operated by Battellc Memorial 
Institute for the U.S. Department of Energy under contract DE-AC06-76RLO 1830. Contact 
Karen Schuchardt for further information. 
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one is shown in units of RT in the a;-axis, while, in the y-axis, one can find in 
logarithmic scale the average computational time taken for each MC, per point 
of the 12x12 grid defined in the Ramachandran space of the model dipcptidc 
HCO-L-Ala-NH2 (see the following pages for several examples). As a general 
thumb-rule, we shall consider a MC to be more efficient for approximating the 
reference when it is placed closer to the origin of coordinates in the efficiency 
plot. This approach is intentionally non-rigorous due to the fact that many 
factors exist that influence the computer time but may vary from one practical 
calculation to another; such as the algorithms used, the actual details of the 
computers (frequency of the processor, size of the RAM and cache memories, 
system bus and disk access velocity, operating system, mathematical libraries, 
etc.), the starting guesses for the SCF orbitals or the starting structures in 
geometry optimizations. Taking this into account, the only conclusions that 
shall be drawn in this work about the relative efficiency of the MCs studied are 
those deduced from strong signals in the plots and, therefore, those that can 
be extrapolated to future calculations; in other words, the small details shall be 
typically neglected. 

The efficiency plots that we will discuss in this section are the ones used 
to compare RHF / / RHF-intramethod homo- and heterolevel MCs with the RHF 
reference, defined as the homolevel MC with the largest basis set, i.e., RHF/ 
6-311++G(2df,2pd) (since, in this section, there is no possible ambiguity, the 
levels shall be denoted in what follows omitting the 'RHF' keyword and speci- 
fying only the basis set). The plots corresponding to this first intramcthod part 
comprise figures from [3] to [6] 

In fig.El the homolevel MCs corresponding to all the basis sets in table [2] are 
compared to the reference one. In fig.[3K, a general picture is presented, whereas, 
in fig. a detailed zoom of the most efficient region of the plot is shown. It 
takes an average of ~ 30 hours per grid point to calculate the PES of the model 
dipeptide HCO-L-Ala-NH 2 at the reference homolevel 6-311++G(2df,2pd) (the 
time per point for homolevels is calculated assuming that all geometry optimiza- 
tions take 20 iterations to converge, this is done in order to avoid the ambiguity 
due to the choice of the starting structures and it allows to place all MCs on 
a more equivalent footing); this time is denoted by ibest and the most efficient 
region is defined as that in which d\2 < RT and t < 10% of ibest- Additionally, 
we indicate in the plots the peptide region (0.1RT < di2 < RT), containing 
the MCs that may correctly approximate the reference one for chains of 1-100 
residues, and the protein region (0 < d\2 < 0.1i?T), including the MCs that are 
accurate for polypeptides over 100 residues (see sec. 

From these two plots, several conclusions may be drawn: 

• Regarding the check of rules (ii) and (iii) via the basis sets in the sec- 
ond group in table [H we see that 6-31+G(d,p) is more efficient than 6- 
31++G(d,p) (it is cheaper and, despite being smaller, more accurate^), 

2 Note that the Hartree-Fock method has a variational origin, in such a way that, if we 
were interested in the absolute value of the energy, and not in the energy differences, an 
enlargement of the basis set would always improve the results. 
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Figure 3: Efficiency plots of the RHF-homolevel MCs corresponding to the ba- 
sis sets in table [21 In the x-axis, we show the distance di2, in units of RT at 
300° K, between any given MC and the reference one (indicated by an encir- 
cled point), while, in the y-eods, we present in logarithmic scale the average 
computational time taken for each MC, per point of the 12x12 grid defined in 
the Ramachandran space of the model dipeptide HCO-L-Ala-NH2. (a) General 
view containing all basis sets, (b) Detailed zoom of the most efficient region of 
the plot (di2 < RT and t < 10% of t hcst ). 
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that 6-31++G(d) is as efficient as the most efficient basis sets of the 
rules-complying group (being outperformed only by some of the ones in 
the third group in tabled]), that 6-31 • +G(d,p) has drifted a little to- 
wards the inefficiency region and that that 6-31++G( ■ ,p) is well deep 
in it. This suggests that it may be profitable to break rule (ii) but only 
in the direction of removing shells from the hydrogens, and not from the 
Ist-row atoms; in agreement with the common practice in the litera- 
ture [6,16,20,22,27,29,31,32,93,97] based on the intuition that 'hydrogens 
are typically more passive atoms sitting at the end of bonds' [56]. On the 
other hand, 6-31G(f,d) turns out to be very inefficient, being about as ac- 
curate as the simple 6-31G basis set but far more expensive. This confirms 
that it is a good idea to follow restriction (iii), which is consistent with 
the already mentioned thumb-rule of basis set 'balance' [56]. 

The whole 3-21G family of basis sets is very inefficient. Only 3-21++G(d,p) 
lies in the accurate region and, anyway, it is less efficient than most of the 
other basis sets there. 

Contrarily, the 4-3 1G family results are quite parallel to and only slightly 
worse than those of the 6-31G family, suggesting that, to account for 
conformational energy differences within the RHF method, the contraction 
of valence orbitals is more important than the contraction of core ones if 
homolcvcl MCs are used. 

In fig. we can notice the existence of a gap in the values of the dis- 
tance d\2, which lies around d\2 = RT and separates the MCs in two 
groups. Notably, all the basis sets in the most accurate group share a 
common characteristic: they contain Ist-row atoms polarization functions, 
whereas those in the inaccurate group do not, with the only exceptions of 
3-21G(d,p) and 6-31G(f,d), whose bad quality has been explained in the 
previous points for other reasons. 

All the basis sets with extra polarizations, (2d,2p) or (2df,2pd), and no 
diffuse functions are less efficient than their diffuse functions- containing 
counterparts. 

Out of some of the specially inefficient cases discussed in the preceding 
points, the addition of diffuse functions to singly-polarized ((d) or (d,p)) 
basis sets always increases the accuracy. 

The only basis set whose homolevel MC lies in the protein region is the 
expensive 6-3H — hG(2df,2pd). 

If we look at the most efficient basis sets (those that lie at the lower-left 
envelope of the 'cloud' of points), we can see that no accumulation point 
is reached, i.e., that, although the distance between 6-311++G(2df,2pd) 
and 6-31+- |-G(2df,2pd) is small enough to suggest that we are close to the 
Hartree-Fock limit for this particular problem, if the basis set is intelli- 
gently enlarged, we obtain increasingly better MCs. 
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Figure 4: Efficiency plots of the RHF-heterolevel MCs L^ st //L l G obtained com- 
puting the geometries with all the basis sets in table [2] but the largest one and 
then performing a single-point energy calculation at the best level of the the- 
ory, L hcst :=6-311++G(2df,2pd), on top of each one of them. In the x-axis, 
we show the distance di2, in units of RT at 300° K, between any given MC 
and the reference one (the homolevel 6-311++G(2df,2pd), indicated by an en- 
circled point), while, in the y-axis, we present in logarithmic scale the average 
computational time taken for each MC, per point of the 12x12 grid defined in 
the Ramachandran space of the model dipeptide HCO-L-Ala-NH2. (a) General 
view containing all basis sets, (b) Detailed zoom of the most efficient region of 
the plot (d 12 < RT and t < 20% of f best ). 
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• For less than 10% the cost of the reference calculation, some particularly 
efficient basis sets for RHF-homolevel MCs that can be used without al- 
tering the relevant conformational behaviour of short peptides (i.e., whose 
distance d 12 with 6-311++G(2df,2pd) is less than RT) arc 6-31+G(d,p), 
6-31+G(d), 4-31+G(d) and 6-31G(d). 

Next, in fig. 5J the reference homolevel 6-311++G(2df,2pd) is compared 
to the RHF// RHF-intramethod-heterolevel MCs L^r st //L G obtained computing 
the geometries with the 38 remaining basis sets in table [5] and then perform- 
ing a single-point energy calculation at the best level of the theory, L hcst := 
6-311++G(2df,2pd), on top of each one of the structures. The aim of this com- 
parison is twofold: on the one hand, we want to measure the relative efficiency 
of the different basis sets for calculating the geometry (not the energy), on the 
other hand, we want to find out whether or not the heterolevel approximation 
described in the introduction is an efficient one within RHF. 

Like in the previous case, in fig. [4^,, a general picture is presented, whereas, 
in fig. UJd, a detailed zoom of the most efficient region of the plot is shown. The 
average time per point t of the heterolevel MCs has been calculated adding the 
average cost of performing a single-point at L best :=6-311+-|-G(2df,2pd) (~ 1.7 
hours) to the average time per point needed to calculate the geometry at each 
one of the levels L G (see page [13]). This ~ 1.7 hours 'offset' in all the times, has 
rendered advisable to set the limit used to define the efficient region in this case 
to the 20% of tbest (instead of the former 10%), so that most of the relevant 
basis sets are included in the second plot in fig. [4Jd. 

In this second part of the present section, several interesting conclusions may 
be extracted from the plots: 

• Related to the test of rules (ii) and (iii) , similar remarks to the ones before 
can be made, the only difference being that, in this case, for computing 
the geometry, 6-31 • +G(d,p) is not as bad as for the homolevel calculation. 
The signal, however, is rather weak and the main conclusions stated in the 
first point above should not be modified. 

• Regarding the 3-21G family of basis sets, we see here that, differently 
from what happened for the homolcvcls, they are not so bad to account 
for the geometry, and, in the case of 3-21G, 3-21++G and 3-21++G(d,p), 
their efficiency is close to that of the corresponding 4-31G and 6-31G 
counterparts. 

• Moreover, the 4-31G basis sets performance is still quite close to that 
of the 6-31G family. This point, together with the previous one, and 
differently from what happened in the case of homolevel MCs, suggests 
that, to account for the equilibrium geometry within RHF, the contraction 
scheme is only mildly important both for valence and core orbitals. 

• In fig. [4Ji, we see again, a gap in the values of the distance di2 separating 
the MCs with the geometry calculated using basis sets that contain lst-row 
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atoms polarization functions from those that do not. The only differences 
are that, this time, the gap is even more evident, it lies around d\2 = 
Q.2RT, and 3-21G(d,p) is placed below it. 

The signal noticed in the homolevel case regarding the relative inefficiency 
of the the basis sets with extra polarizations, (2d,2p) or (2df,2pd), and 
no diffuse functions has become stronger here and a second gap can be 
seen separating them from their diffuse functions-containing counterparts 
and also from the basis sets with only one polarization shell. This gap 
separates, for example, the MCs whose geometries have been calculated 
with 6-31G(2df,2pd) and 6-31G(d,p), in such a way that the smaller one 
is not only more efficient, but also more accurate. This clearly illustrates 
one of the points raised in this study, namely, that MCs parameterized 
and tested in concrete systems may behave in an unexpected way when 
used in a different problem, and that the investigation of the quality of the 
most popular MCs, as well as the design of new ones, for the study of the 
conformational preferences of peptides, is a necessary (albeit enormous) 
task. 

Also for geometry optimizations, the addition of diffuse functions to singly- 
polarized ((d) or (d,p)) basis sets increases the accuracy. 

Contrarily to the situation for homolevels, where the only basis set that 
lied in the protein region was the 6-31++G(2df,2pd) one and some MCs 
presented distances of near 3RT with the reference one, here, all MCs lie 
well below d\2 = RT, and those for which the geometry has been computed 
with a basis set that contains lst-row atoms polarization functions (except 
for 6-31G(f,d)) arc all in the protein region, so that, under the assumptions 
in sec. 12.21 they can correctly approximate the reference MC for chains of 
more than 100 residues. Remarkably, some of this heterolevel MCs, such 
as 6-311++G(2df,2pd)//6-31+G(d) for example, are physically equivalent 
to the reference homolevel up to peptides of ten thousand residues at less 
of 10% the computational cost. Indeed, all these results confirm the het- 
erolevel assumption, discussed in the introduction and so commonly used 
in the literature [14,29,31,32,46,59,63,64,90], for RHF//RHF-intramethod 
MCs. 

Differently from the homolevel case, an accumulation point is reached here 
in the basis sets, since, in fig. we can see that there is no noticeable 
increase in the accuracy, say, beyond 6-311+G(d). 

Finally, let us mention 6-311+G(d), 6-31+G(d) and 6-31G(d) as some ex- 
amples of particularly efficient basis sets for calculating the geometry in 
RHF-heterolevel MCs. Under the approximation in sec. 12.21 they can be 
used without altering the relevant conformational behaviour of polypep- 
tides of more than a hundred residues (i.e., their distance d±2 with the 
homolevel 6-311++G(2df,2pd) is less than 0.1RT), and their computa- 
tional cost is less than 20% that of the reference calculation. 
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Figure 5: Efficiency plots of the RHF-heterolevel MCs L l E //L^? st obtained com- 
puting the geometry at the best level of the theory, L hcst :=6-311++G(2df,2pd), 
and then performing a single-point calculation with all the basis sets in ta- 
ble [2] but the largest one. In the x-axis, we show the distance di2, in units 
of RT at 300° K, between any given MC and the reference one (the homolevel 
6-311++G(2df,2pd), indicated by an encircled point), while, in the y-axis, we 
present in logarithmic scale the average computational time taken for the corre- 
sponding single-point, per point of the 12x12 grid defined in the Ramachandran 
space of the model dipeptide HCO-L-Ala-NH2. (a) General view containing all 
basis sets, (b) Detailed zoom of the most efficient region of the plot (d\2 < RT 
and t < 10% of tbest)- 



19 



Now, after the geometry, we shall investigate the efficiency for performing 
energy calculations within RHF of the all the basis sets in table [2] but the 
largest one. To render the study meaningful, the geometry on top of which 
the single-points are computed must be the same, and we have chosen it to be 
the one calculated at the level L best :=6-311++G(2df,2pd). Of course, since 
the reference to which the L E //L ] Q St heterolevel MCs must be compared is the 
L bcst homolevcl, and they take more computational time than this MC (the 
time ibest plus the one required to perform the single-point at L E ), all of them 
are computationally inefficient a priori. Therefore, in the efficiency plots in 
fig. O the time shown in the y-axis is not the one needed to calculate the actual 
PES with the L E //L^? st MC, but just the one required for the single-point 
computation. In principle therefore, the study and the conclusions drawn should 
be regarded only as providing hints about how efficient a given basis set will 
be if it is used to calculate the energy on top of some less demanding geometry 
than the L best one (in order to have a MC that could have some possibility 
of being efficient). However, in the fourth part of the RHF//RHF-intramcthod 
investigation (see below), we show that the performance of the different basis 
sets for single-point calculations depends weakly on the underlying geometry, 
so that the range of validity of the present part of study must be thought to be 
wider. 

In fig. [5^, a general picture of the comparison is presented, whereas, in 
fig. [5b, a detailed zoom of the most efficient region of the plot is shown. As 
we have already mentioned, the time t shown is the average one per point 
required to perform the single-point energy calculation on the best geometry, 
and, consequently, the time ibest used for defining the efficient region has been 
redefined as the one needed for a single-point at £ bcst (i.e., f bcst ~ 1.7 hours). 

We extract the following conclusions from the plots: 

• Regarding the check of rules (ii) and (iii), the situation is the same as 
in the two former cases, with the only difference that we can see that, 
for single-point calculations, 6-31G(f,d) is much more inefficient than for 
geometry optimizations, being of an accuracy close to that of the smallest 
basis set studied, the 3-21G, and taking considerably more time. 

• The 3-2 1G family of basis sets is very inefficient for energy calculations. 

• On the other hand, like it happened in the homolevcls case, the 4-31G basis 
sets performance is quite close to that of the 6-31G family. This suggests 
that, for energy calculations in RHF//RHF MCs, to use a considerable 
number of primitive Gaussian shells to form the contracted ones is more 
important in the valence orbitals than in the core ones. 

• The Ist-row atoms polarization gap in the distance d\i also occurs for 
single-point calculations (see fig. [5^,). This time, 3-21G(d,p) is placed 
above it. 

• The relative inefficiency of the the basis sets with extra polarizations, 
(2d,2p) or (2df,2pd), and no diffuse functions is also observed here for 
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energy calculations. It is mild, like in the homolevels case, and no gap 
appears. 

• Like in the two studies above, the addition of diffuse functions to singly- 
polarized ((d) or (d,p)) basis sets increases the accuracy for single-point 
energy calculations as well. 

• About the accuracy of the investigated MCs, the situation observed in the 
homolevel case is even more severe here, since not even the 6-31++G(2df,2pd) 
single-point MC lies in the protein region and the worst basis sets (see 
3-21G, for example) present distances over 3RT. This enriches and sup- 
ports the ideas that underlie the heterolevel assumption, showing that, 
whereas the level of the theory may be lowered in the calculation of the 
(constrained) equilibrium geometries, it is necessary to perform high-level 
energy single-points if a good accuracy is sought. 

• Related to the basis set convergence issue, the situation here is analogous 
to the one seen in the case of homolevel MCs: No accumulation point is 
reached, and the accuracy can always be increased by intelligently enlarg- 
ing the basis set. 

• Finally, let us mention 6-311+G(d,p), 6-31+G(d) and 6-31G(d) as some 
examples of particularly efficient basis sets also for calculating the energy 
in RHF- heterolevel MCs. Under the approximation in sec. 12.21 they can be 
used without altering the relevant conformational behaviour of short pep- 
tides, and their computational cost is less than 10% that of the reference 
single-point calculation. 

Next, in order to close the RHF//RHF-intramethod section, we evaluate a 
group of heterolevel MCs which are constructed by simultaneously decreasing 
the level of the theory used for the geometry and the one used for the energy 
single-point, relatively to the reference 6-311++G(2df,2pd). 

Using the basis sets in tabled there exist 38 x (38 — 1) = 1406 different 
MCs of the form L' E //L G , with L% ^ L G and excluding 6-311++G(2df,2pd). 
This number is too large to perform an exhaustive study and, therefore, any 
investigation of the MCs in this particular group must be necessarily exploratory. 
Here, wc arc specially interested in the most efficient MCs, so that, using the 
lessons learned in the preceding paragraphs, we have considered only hctcrolcvcls 
with L G being 6-31G(d), 6-31+G(d) or 6-311+G(d), which we have proved to 
perform well at least when the single-point is calculated at 6-311++G(2df,2pd). 
For the choice of L E , different criteria have been followed. On the one hand, 
since the energy at level L G is readily available as an output of the geometry 
optimization step, it is clear that to perform a single-point calculation with a 
level of similar accuracy to L l G will not pay. On the other hand, some hints may 
be extracted from the study in fig. [5] about which could be the most efficient 
basis sets for calculating the energy. Taking these two points into consideration, 
and also including, for checking purposes, some levels that are expected to 
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Figure 6: (a) Efficiency plot of some selected RHF-heterolevel MCs L % E ljL l G 
with L l E Lq and both of them different from the best level 6- 

311 ++G(2df,2pd). The MCs calculated on top of the same geometry are joined 
by broken lines, (b) Efficiency plot of all the MCs in figs. [3J [4] and in the 
(a)-part of this figure. In both figures, in the x-axis, we show the distance di2, 
in units of RT at 300° K, between any given MC and the reference one (the 
homolevel 6-311++G(2df,2pd), indicated by an encircled point in (b)), while, 
in the y-axis, we present in logarithmic scale the average computational time 
per point of the 12x12 grid defined in the Ramachandran space of the model 
dipeptide HCO-L-Ala-NH2. The different accuracy regions depending on di2 7 
are labeled, and the 10% of the time ibost taken by the reference homolevel 
6-311++G(2df,2pd) is also indicated. 



22 



Efficient RHF//RHF MCs 


d 12 /RT a 


AT b 
iv rcs 


t c 


6-311++G(2df,2pd) //6-31++G(d,p) 


0.008 


17382.5 


11.74% 


6-311++G(2df,2pd) //6-31+G(d) 


0.009 


11752.4 


7.53% 


6-311++G 2df,2pd)//4-31+G d) 

i \ ' L /II \ J 


0.014 


5163.4 


7.38% 


6-311++G(2df,2pd)//6-31G(d) 


0.031 


1066.0 


6.62% 


6-31++G(2df,2pd) //6-31G(d) 


0.106 


89.3 


4.86% 


6-31++G(2d,2p) //6-31G(d) 


0.213 


22.0 


1.92% 


6-311++G(d,p)//6-31G(d) 


0.275 


13.2 


1.60% 


6-31++G(d,p)//6-31G(d) 


0.318 


9.9 


1.29% 


4-31++G(d,p)//6-31G(d) 


0.368 


7.4 


1.27% 


6-31G(d)//6-31G(d) 


0.683 


2.1 


0.95% 


6-31G//6-31G 


1.634 


0.4 


0.33% 


3-21G//3-21G 


2.908 


0.1 


0.30% 



Table 3: List of the most efficient RHF//RHF-intramethod MCs located at the lower- 
left envelope of the cloud of points in fig. [6b. The first block contains MCs of the form 
Le'^Z/Lq (see fig. [4j , the second one those of the form L^Z/Lq (see fig. and the 
third one the homolevels in fig. [3] "Distance with the reference MC (the homolevel 
6-311++G(2df,2pd)), in units of RT at 300° K. 'Maximum number of residues in a 
polypeptide potential up to which the corresponding MC may correctly approximate 
the reference (under the assumptions in sec. l2.2p . c Required computer time, expressed 
as a fraction of tbest- 

be inefficient, the basis sets that are investigated for performing single-points 
within RHF are those shown in fig. [6^i, where tbcst is again the time taken by 
the reference homolevel 6-311++G(2df,2pd). 

There are no essentially new conclusions to extract from this part of the 
study, since it mainly confirms those drawn from the previous parts and shows 
that they can be combined rather independently. For example, the approximate 
verticality of the dotted lines joining the MCs with equal L E indicates, as we 
have already mentioned, that, in the RHF//RHF case, the accuracy of a given 
MC depends much more strongly on the level used for calculating the energy 
than on the one used for the geometry. Also, the fact that the MCs with L l G = 
6-31G(d) lie in the lower-left envelope of the plot shows that 6-31G(d) keeps 
its character of efficient basis set for computing the geometry even if the single- 
point is calculated with levels that are different from the reference one. Finally, 
note that, in this particular problem, within RHF, and under the assumptions 
in sec. 12. 2\ if one wants to correctly approximate the reference MC beyond 
100-residue peptides, the energy must be calculated at 6-31++G(2df,2pd). 

In fig. [5b, all the 110 MCs studied up to now are depicted as a summary 
(the 38 inefficient L E //L^ st ones are not shown). Now, if we look at the lower- 
left envelope of the plot, we can see that, depending on the target accuracy 
sought, the most efficient MCs may belong to different groups among the ones 



23 



investigated above. From ~ ORT to ~ 0.1i?T, for example, the most efficient 
MCs are the L h E cst //L l G ones; from - 0.1 RT to - 0.5RT, on the other hand, the 
MCs of the form L E //L G , where the single-point level has also been lowered 
with respect to the reference one, clearly outperform those in the rest of groups; 
finally, for distances dn > 0.5RT, it is recommcndable to use homolevcl MCs. 
In table [31 these efficient MCs are shown together with their distance d\% to 
the reference homolevcl 6-311++G(2df,2pd), the number of residues N Ies up 
to which they can be used as a good approximation of it, and the required 
computer time t, expressed as a fraction of ibest- 

3.2 MP2//MP2-intramethod model chemistries 

Now, using all the information gathered in the previous RHF//RHF-intramcthod 
section (see however sec. 13.31 and the first paragraph of sec. 13. ip . we open the 
second part of the study, in which we shall perform an MP2// MP2-intramethod 
investigation with some selected basis sets among those in table [2j The choice 
of MP2 [98] as the method immediately 'above' RHF is justified by several 
reasons. In the first place, it is typically regarded as accurate and as the reason- 
able starting point to include correlation in the literature [7,64,99-102], where 
it is also commonly used as a reference calculation to evaluate or parameter- 
ize less demanding methods [9,63,103-105]. Secondly, and contrarily to DFT, 
MP2 is a wavefunction-based method that allows to more or less systemati- 
cally improve the calculations by going to higher orders of the M0ller-Plesset 
perturbation expansion. The majority of the rest of methods devised to add 
correlation to the RHF wavefunction-based results, such as coupled cluster, con- 
figuration interaction, or MCSCF, are more computationally demanding than 
MP2 [56,106,107]. Finally, although, for some particular problems, DFT may 
rival MP2 [102,108-110], the latter is known to account better for weak disper- 
sion forces, which are present and may be important in peptides [14,64, 111]. 

The basis sets investigated in this MP2 part are the 11 ones in table |H and 
they have been originally chosen in order to adequately sample the larger set 
studied at RHF and check if the same effects are observed at MP2. Some kind of 
selection must be done due to the higher computational cost of MP2 calculations, 
so that, with the hope that the RHF results were relatively transferable to MP2, 
the basis sets that have proved to be relatively more efficient at RHF were 
included in tabled) together with the largest one, the smallest one and a small 
number of other basis sets (such as 6-31G(d,p) or 6-31G(2d,2p), for example) 



3-21G 6-31G(d) 6-31+G(d) 6-311+G(d) 

6-31G 6-31G(d,p) 6-31++G(d,p) 6-311++G(2df,2pd) 

6-31++G 6-31G(2d,2p) 6-31++G(2d,2p) 



Table 4: Basis sets investigated in the MP2// MP2-intramethod part of the study. The 
largest one is indicated in bold face. 
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Figure 7: Potential energy surface of the model dipeptide HCO-L-Ala-NH2 com- 
puted at the MP2/6-311++G(2df,2pd) level of the theory. The PES has been 
originally calculated in a 12x12 discrete grid in the space spanned by the Ra- 
machandran angles <j) and ip and later smoothed with bicubic splines for visual 
convenience. The energy reference has been set to zero. (At this level of the 
theory, the absolute energy of the minimum point in the 12x12 grid, located at 
(-75°, 75°), is -416.4705201527 hartree). 



intended to analyze the tendencies observed in the previous section. In the 
following discussion and in sec. 13.31 however, the RHF — > MP2 transferability 
of the results is shown to be imperfect, so that, despite the valuable lessons 
learned in this work, in further studies, one of the research directions that will 
have to be followed is the addition of more basis sets to table [H 

We would also like to stress that the MP2-reference PES of HCO-L-Ala-NH 2 , 
using 6-311++G(2df,2pd), that has been calculated to carry out the investiga- 
tion presented here (see fig. [7]) is, as far as we are aware, the one computed 
at the highest level of the theory at present. Although coupled cluster [14, 31] 
and MP4 [30] methods have been used to perform single-points on top of the 
geometries optimized at lower levels for some selected conformer^l, the highest 
homolevels used to calculate full PESs in the literature after the one used in 
this study seem to be MP2/6-311G(d,p) in ref. 28 and B3LYP/6-311++G(d,p) 
in ref. 31 (assuming that the accuracy of the B3LYP method lies somewhere 
between RHF and MP2). Regarding heterolevel MCs, in ref. 43, a PES is cal- 
culated at the LMP2/cc-pVQZ(-g)//MP2/6-31G(d) level, and although this is 
certainly a remarkably accurate computation, the question whether it is bet- 
ter than the homolevel MP2/6-311++G(2df,2pd) will have to wait until an 

3 In this brief review of the literature, we include, apart from the calculations in HCO-L- 
Ala-NH2, also those in alanine dipeptides with different protecting groups, since both efficiency 
and accuracy considerations are expected to be very similar. 
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Figure 8: Efficiency plot of the MP2-homolevel MCs corresponding to all the 
basis sets in table 0J In the x-axis, we show the distance dn in units of RT, 
at 300° K, between any given MC and the reference one (the homolevel 6- 
311++G(2df,2pd), indicated by an encircled point), while, in the y-axis, we 
present in logarithmic scale the average computational time taken for each MC, 
per point of the 12x12 grid defined in the Ramachandran space of the model 
dipeptide HCO-L-Ala-NH 2 . 



assessment of the LMP2 method and its relation to the heterolevel hypothesis 
is performed. Finally, something analogous may be said about the MP2/cc- 
pVTZ//MP2/6-31G(d) PES in ref. 32. 

Now, the structure of the MP2//MP2-intramethod study is the same as in 
the RHF//RHF case: We begin by evaluating the MP2 homolevels, and, just as 
we did before, the 'MP2' keyword is omitted from the MCs specification, since, 
in this section, no possible ambiguity may appear. 

In fig. the homolevel MCs corresponding to all the basis sets in table [4] 
are compared to the reference one. It takes an average of ~ 200 hours ~ 8 days 
of computer time per grid point (see page [13]) to calculate the PES of the model 
dipeptide HCO-L-Ala-NH 2 at the reference homolevel 6-311++G(2df,2pd); this 
time is denoted by ibest- 

Regarding the conclusions that can be extracted from this plot, let us focus 
(remarking the differences) on the issues parallel to the ones studied in the RHF 
case, although, since the number of basis sets in table 0] is smaller than that in 
tabled some details will have to be left out: 

• The 3-21G basis set is again the worst one for homolevel calculations, with 
a distance close to 3 RT. 

• The lst-row atoms polarization gap that we saw in fig. [3^i, is absent here, 
and, for example, the 6-31++G basis set is more accurate than the larger 
and polarized 6-311+G(d). 
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• The only basis set with extra polarizations and no diffuse functions that 
we have studied in the MP2 case, 6-31G(2d,2p) ; is less efficient than its 
diffuse functions- containing counterpart, 6-31H — hG(2d,2p). 

• Whereas, in the RHF case, the addition of diffuse functions to singly- 
polarized ((d) or (d,p)) basis sets always increased the accuracy, here, it 
is sometimes slightly advantageous (in the 6-31G(d,p) — ► 6-31+- |-G(d,p) 
case) and sometimes slightly disadvantageous (in the 6-31G(d) — > 6-31+G(d) 
case). So that no clear conclusion may be drawn to this respect. 

• There is no basis set whose homolevel MC lies in the protein region, al- 
though we remark that the second largest basis set studied with RHF, 
6-31++G(2df,2pd), which lied in the protein region then, has not been 
included in this MP2 part of the work. 

• If we look at the most efficient basis sets (those that lie at the lower- 
left envelope of the 'cloud' of points), we can see that, like in RHF, no 
accumulation point is reached, i.e., that, although the distance between 
6-311++G(2df,2pd) and 6-31++G(2d,2p) is small enough to suggest that 
we are close to the MP2 limit for this particular problem, if the basis set 
is intelligently enlarged, we obtain increasingly better MCs. Also note 
that, if we compare fig. [8] here to fig. [3] in the previous section, we do 
not observe a strong signal indicating the slower basis set convergence 
of the MP2 method that is sometimes reported in the literature [56, 112, 
113]. Therefore, from these limited data, we must conclude that, for 
conformational energy differences in peptides, the homolevel MCs converge 
approximately at the same pace towards the infinite basis set limit for RHF 
and MP2. 

• For less than 10% the cost of the reference calculation, some particularly 
efficient basis sets for MP2-homolevel MCs that can be used without al- 
tering the relevant conformational behaviour of short peptides (i.e., whose 
distance d 12 with 6-311++G(2df,2pd) is less than RT) arc 6-31++G(d,p), 
6-31G(d,p) and 6-31G(d). 

Next, in fig. [9l the reference homolevel 6-311++G(2df,2pd) is compared 
to the MP2/ / MP2-intramethod-heterolevel MCs L^f/L^ obtained computing 
the geometries with the 10 remaining basis sets in tabled] and then performing 
a single-point energy calculation at the best level of the theory, L hest —6- 
311++G(2df,2pd), on top of each one of the structures. Like in the RHF case, 
the aim of this comparison is twofold: on the one hand, we want to measure the 
relative efficiency of the different basis sets for calculating the geometry (not the 
energy), on the other hand, we want to find out whether or not the heterolevel 
assumption described in the introduction is a good approximation within MP2. 

The average time per point t of the heterolevel MCs has been calculated 
adding the average cost of performing a single-point at L hcst := 6-311++G 
(2df,2pd) (~ 2.7 hours) to the average time per point needed to calculate the 
geometry at each one of the levels L l G (see page [T3|) . 
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Figure 9: Efficiency plot of the MP2-heterolevel MCs L h £ st //L l G obtained com- 
puting the geometries with all the basis sets in table 0] but the largest one and 
then performing a single-point energy calculation at the best level of the theory, 
L bost :=6-311++G(2df,2pd), on top of each one of them. In the x-axis, we 
show the distance d±2, m units of RT at 300° K, between any given MC and 
the reference one (the homolevel 6-311++G(2df,2pd), indicated by an encircled 
point), while, in the y-axis, we present in logarithmic scale the average com- 
putational time taken for each MC, per point of the 12x12 grid defined in the 
Ramachandran space of the model dipeptide HCO-L-Ala-NH2. 



The following remarks may be made about fig. O 

• Although the only representant of the 3-21G family of basis sets in this 
MP2//MP2-intramethod study is one of the most inaccurate levels for 
calculating the geometry, the signal observed in the RHF case, indicating 
that the 3-2 1G basis sets are not so bad to account for the geometry, also 
occurs here, where we can see that 3-2 1G is more accurate (and hence 
more efficient) than the larger 6-31G and 6-31++G. 

• Contrarily to the homolevel case, here we can appreciate, like we did in 
RHF, a rather wide gap in the values of the distance d\i separating the 
MCs with the geometry calculated using basis sets that contain lst-row 
atoms polarization functions from those that do not. 

• The signal noticed in the homolevel case regarding the relative inefficiency 
of the the basis sets with extra polarizations and no diffuse functions has 
been inverted here, since the 6-31-1 — |-G(2d,2p) is less accurate than the 
smaller 6-31G(2d,2p). 

• Again, and contrarily to the RHF case, the addition of diffuse functions 
to singly-polarized ((d) or (d,p)) basis sets it is sometimes slightly advan- 
tageous (in the 6-31G(d,p) — > 6-31++G(d,p) case) and sometimes slightly 
disadvantageous (in the 6-31G(d) — » 6-31+G(d) case). So that no clear 
conclusion may be drawn to this respect. 
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• Like in the RHF case, and contrarily to the situation for MP2 homolevels, 
where no basis sets lied in the protein region and some MCs presented 
distances of near 3RT with the reference one, here, most MCs lie well 
below G?i2 = RT, and those for which the geometry has been computed 
with a basis set that contains lst-row atoms polarization functions are 
all in the protein region, so that, under the assumptions in sec. 12.21 
they can correctly approximate the reference MC for chains of more than 
100 residues. Remarkably, some of these heterolevel MCs, such as 6- 
311++G(2df,2pd)//6-31G(d) for example, are physically equivalent to the 
reference homolevel up to peptides of 400 residues at less of 10% the com- 
putational cost. Indeed, all these results confirm the heterolevel assump- 
tion, discussed in the introduction and so commonly used in the litera- 
ture [14,29,31,32,46,59,63,64,90], for MP2//MP2-intramethod MCs. 

• Differently from the homolevel case, an accumulation point is reached here 
in the basis sets, since we can see that there is no noticeable increase in 
accuracy beyond 6-31G(d). Regarding the convergence towards the infi- 
nite basis set limit, we observe again that, whereas it is slightly slower 
here than in fig. [51 the signal is too weak to conclude anything and we re- 
peat what we said in the homolevel case: that, for conformational energy 
differences in peptides, the ability of accounting for the geometry in het- 
erolevel MCs of the form L E est //L l G converges approximately at the same 
pace towards the infinite basis set limit for RHF and MP2. 

• Finally, let us mention 6-31G(d) as the one clear example of a particu- 
larly efficient basis set for calculating the geometry in MP2-heterolevel 
MCs. Under the assumptions in sec. 12.21 it can be used without alter- 
ing the relevant conformational behaviour of polypeptides of around 400 
residues (i.e., its distance dn with the homolevel 6-311++G(2df,2pd) is 
~ 0.05i?T), and its computational cost is ~ 2% that of the reference cal- 
culation. The rest of the basis sets in fig. [9] are either less accurate and 
not significantly cheaper, or more expensive and not more accurate than 
6-31G(d). 

Next, after the geometry, we investigate the efficiency for performing en- 
ergy calculations of all the basis sets in table [4] but the largest one. Like in 
the RHF case, the geometry on top of which the single-points are computed 
must be the same, and we have chosen it to be the one calculated at the level 
L bcst := 6-311++G(2df,2pd). Again, since the reference to which the L E //L G est 
heterolevel MCs must be compared is the L best homolevel, and they take more 
computational time than this MC (the time ibest plus the one required to per- 
form the single-point at L l E ), all of them are computationally inefficient a priori. 
Therefore, in the efficiency plot in fig. [TQ1 the time shown in the y-axis is not 
the one needed to calculate the actual PES with the L E //L G cst MC, but just the 
one required for the single-point computation. In principle therefore, the study 
and the conclusions drawn should be regarded only as providing hints about 
how efficient a given basis set will be if it is used to calculate the energy on top 
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Figure 10: Efficiency plot of the MP2-heterolevel MCs L l E //L^f st obtained com- 
puting the geometry at the best level of the theory, L hest :=6-311++G(2df,2pd), 
and then performing a single-point calculation with all the basis sets in table [2] 
but the largest one. In the s-axis, we show the distance di2, in units of RT 
at 300° K, between any given MC and the reference one (the homolevel 6- 
311++G(2df,2pd), indicated by an encircled point), while, in the y-axis, we 
present in logarithmic scale the average computational time taken for the corre- 
sponding single-point, per point of the 12x12 grid defined in the Ramachandran 
space of the model dipeptide HCO-L-Ala-NH 2 . 



of some less demanding geometry than the L best one (in order to have a MC 
that could have some possibility of being efficient). However, in the fourth part 
of the MP2//MP2-intramethod investigation (see below), we show, like we did 
in the RHF case, that the performance of the different basis sets for single-point 
calculations depends weakly on the underlying geometry, so that the range of 
validity of the present part of study must be thought to be wider. Again, the 
time £best used for defining the efficient region in fig. [TO] has been redefined as 
the one needed for a single-point at L best . 

The conclusions of this part of the study are: 

• Like in RHF, 3-21G is very inefficient for energy calculations. 

• Although all basis sets containing lst-row atoms polarization functions are 
more accurate than the ones that do not, differently from the geometry 
case, we do not observe a clear gap in the distance d\2 separating the two 
groups for MP2 single-point energy calculations. 

• Similarly to the MP2-homolevel case and to RHF, the respective positions 
in the plot of 6-31G(2d,2p) and 6-31++G(2d,2p) constitute a signal that 
indicates that the basis sets with extra polarizations and no diffuse func- 
tions are less efficient than their diffuse functions- containing counterparts 
for energy calculations. 
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Figure 11: Efficiency plot of all the MCs in figs. [8] and [9l and also of four 
additional ones of the form L^//6-31G(d). Only the latter arc labeled. In the 
cc-axis, we show the distance dn, in units of RT at 300° K, between any given 
MC and the reference one (the homolevel 6-311++G(2df,2pd), indicated by an 
encircled point), while, in the y-axis, we present in logarithmic scale the average 
computational time per point of the 12x12 grid defined in the Ramachandran 
space of the model dipeptide HCO-L-Ala-NH2. The different accuracy regions 
depending on d±2, are labeled, and the 10% of the time ibest taken by the 
reference homolevel 6-311++G(2df,2pd) is also indicated. 



• As in the rest of the MP2 study, nothing conclusive can be said about 
the addition of diffuse functions to the singly polarized 6-31G(d) and 6- 
31G(d,p) basis sets. 

• Regarding the accuracy of the investigated MCs, the situation here is anal- 
ogous to the one found for RHF, and, again this supports the ideas that 
underlie the heterolevel assumption, showing that, also at MP2, whereas 
the level of the theory may be lowered in the calculation of the (constrained) 
equilibrium geometries, it is necessary to perform high-level energy single- 
points if a good accuracy is sought. 

• Related to the basis set convergence issue, the situation here is analogous 
to the one seen in the case of homolevel MCs: No accumulation point is 
reached, and the accuracy can always be increased by intelligently enlarg- 
ing the basis set. The convergence velocity towards the MP2 limit is again 
very similar to the one in RHF. 

• Finally, let us mention 6-31G(d,p) and 6-31G(d) as some examples of par- 
ticularly efficient basis sets for calculating the energy in MP2-heterolevel 
MCs. They can be used without altering the relevant conformational be- 
haviour of short peptides, and their computational cost is less than 10% 
that of the reference single-point calculation. 
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Efficient MP2//MP2 MCs 


du/RT a 


N b 

1 v res 


t c 


6-311++G(2df,2pd)//6-31G(d) 


0.046 


468.3 


1.90% 


6-31++G(2d,2p)//6-31G(d) 


0.312 


10.2 


0.79% 


6-31++G(d,p)//6-31G(d) 


0.703 


2.0 


0.62% 


6-31G(d)//6-31G(d) 


0.729 


1.9 


0.56% 


6-31G//6-31G 


1.247 


0.6 


0.19% 


3-21G//3-21G 


3.076 


0.1 


0.17% 



Table 5: List of the most efficient MP2//MP2-intramethod MCs located at the lower- 
left envelope of the cloud of points in fig. 1111 The first block contains MCs of the form 
L| cs 7/L l G (see fig.©, the second one those of the form l4//6-31G(d) (see fig. [TTj) . 
and the third one the homolevels in fig. [8] "Distance with the reference MC (the 
homolevel 6-311++G(2df,2pd)), in units of RT at 300° K. ''Maximum number of 
residues in a polypeptide potential up to which, under the assumptions in sec. 12.21 
the corresponding MC may correctly approximate the reference. c Required computer 
time, expressed as a fraction of tbeat- 

To close the MP2//MP2-intramethod section, we have calculated four PESs 
with MCs of the form L^//6-31G(d), since the geometry computed with 6- 
31G(d) has proved to be very accurate when a single-point at the highest level 
was performed on top of it. Due to the same computational arguments presented 
in the previous section, only those basis sets significantly larger than 6-31G(d) 
have been explored for calculating the energy. The results are presented in 
fig. fTTI together with a summary of the rest of the MP2//MP2 MCs studied in 
this section (except for the inefficient L E //L^? st ones). 

We have already advanced a conclusion that may be extracted from this last 
plot, namely, that if we compare the distance d±2 of the L E //6-3lG(d) MCs 
in fig. [IT] to the distance of the L E //L^? st ones in fig. [10] for the same L l E , we 
see that they are very close. Therefore, like in the RHF case, we conclude that 
the accuracy of a given MC depends much more strongly on the level used for 
calculating the energy than on the one used for the geometry. 

Finally, in table [3 we present the most efficient MCs that lie at the lower- left 
envelope of the plot in fig. [TT] Like in RHF, we can see that, depending on the 
target accuracy sought, these most efficient MCs may belong to different groups 
among the ones investigated above. From ~ 0i?T to ~ Q.IRT, for example, 
the most efficient MCs are the L E ost //L l G ones; from - 0.1i?T to - 0.75i?T, 
on the other hand, the MCs of the form L^//6-31G(d) outperform those in the 
rest of groups; finally, for distances d\2 > 0.75RT, it is recommendable to use 
homolevel MCs. 



32 



3.3 Interlude 



The general abstract framework behind the investigation presented in this study 
(and also behind most of the works found in the literature), may be described 
as follows: 

The objects of study are the model chemistries defined by Pople [60] and 
discussed in the introduction. The space containing all possible MCs is a rather 
complex and multidimensional one and it is denoted by A4 in fig. [T2J The MCs 
under scrutiny are applied to a particular problem of interest, which may be 
thought to be formed by three ingredients: the physical system, the relevant 
observables and the target accuracy. The MCs are then selected according to 
their ability to yield numerical values of the relevant observables for the physical 
system studied within the target accuracy. The concrete numerical values that 
one wants to approach are those given by the exact model chemistry MC e , which 
could be thought to be either the experimental data or the exact solution of 
the electronic Schrodinger equation. However, the computational effort needed 
to perform the calculations required by MC e is literally infinite, so that, in 
practice, one is forced to work with a reference model chemistry MC rcf , which, 
albeit different from MC e , is thought to be close to it. Finally, the set of MCs 
that one wants to investigate are compared to MC rof and the nearness to it is 
seen as approximating the nearness to MC e . 

These comparisons are commonly performed using a numerical quantity T> 




Figure 12: Space M. of all model chemistries. The exact model chemistry MC £ 
is shown as a black circle, MP 2 MCs are shown as grey-filled circles and RHF 
MCs as white-filled ones. The homolevel reference PESs are indicated with an 
additional circle around the points. The situation depicted is (schematically) 
the one found in this study. 
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that is a function of the relevant observables. In order for the intuitive ideas 
about relative proximity in the M. space to be captured and the above reasoning 
to be meaningful, this numerical quantity T> must have some of the properties 
of a mathematical distance. In particular, it is advisable that the triangle in- 
equality is obeyed, so that, for any model chemistry MC, one has that 



and, assuming that £>(MC e ,MC rcf ) is small (and I? is a positive function), 
we obtain 



which is the sought result in agreement with the ideas stated at the beginning 
of this section. 

The distance d\i introduced in ref. 92 and summarized in sec. l2.2[ measured 
in this case on the conformational energy surfaces (the relevant observable) 
of the model dipeptide HCO-L-Ala-NH2 (the physical system), approximately 
fulfills the triangle inequality and thus captures the nearness concept in the 
space M. of model chemistries. 

Now, as we have advanced and after having completed the intramcthod 
parts of the study with both the RHF and MP2 methods, we shall use the ideas 
discussed above to tackle the natural question about the transferability of the 
RHF results to the more demanding and more accurate MP2-based MCs. 

As a first step to answer this question, we point out that the distance between 
the reference RHF/6-311++G(2df,2pd) MC and the MP2 one depicted in fig. 
is ~ 1A2RT. This prevents us from using the former as an approximation 
of the latter even for dipeptides if we want that the conformational behaviour 
at room temperature be unaltered. It also indicates that, whereas basis set 
convergence has been reasonably achieved within the family of Pople's Gaussian 
basis sets, both for homo- and hetcrolevcl MCs and inside the two methods, the 
convergence in method has not been achieved in the RHF — > MP2 step, even 
with the largest basis set investigated 6-311++G(2df,2pd). 

Complementarity to this, in fig. [T3J we show the distance of all RHF//RHF 
MCs studied in sec. GO] (except for the inefficient L l E //L^? st ones), with both the 
RHF reference (in the y-axis) and the MP2 one (in the x-axis). Some relevant 
remarks may be made about the situation encountered: 

• The distance of all RHF-intramethod MCs to the MP2 reference is larger 
than RT, therefore, none of the former may be used to approximate the 
latter, not even in dipeptides. 

• Although a general trend could be perceived and, for example, the RHF 
homolevels can be clearly divided in both axes by the lst-row atoms po- 
larization gap found in the previous sections, the correlation between the 



2?(MC e , MC) < X>(MC e , MC rcf ) + X>(MC rof , MC) , 
D(MC e ,MC)> |P(MC e ,MC rcf ) - X>(MC rof ,MC)| , 



(4a) 
(4b) 



P(MC £ , MC) ~ P(MC rof , MC) , 



(5) 
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Figure 13: All RHF-intramethod MCs studied in sec. 13.11 except for the inef- 
ficient L l E //L i Q St ones. The distance d\2, in units of RT at 300° K, with the 
homolcvcl MP2/6-311++G(2df,2pd) reference is shown in the x-axis, while the 
distance with the RHF reference is shown in the y-axis. The different accuracy 
regions depending on d\2, arc labeled, and two groups of homolevel MCs are 
distinguished: those that contain lst-row atoms polarization shells and those 
that do not. 



distance to the MP2 reference and the distance to the RHF one is as low 
as r ~ 0.66, being r Pearson's correlation coefficient. Therefore, almost all 
details are lost and the accuracy with respect to RHF/6-311++G(2df,2pd) 
cannot be translated into accuracy with respect to the MP2 reference. 

• Related to the previous point, some strange behaviours are present. For 
example, not only are there RHF//RHF MCs that are closer to the MP2 
reference than RHF/6-311++G(2df,2pd), but the one that is closest is the 
small RHF/4-31G(d,p) homolevel. This is probably caused by fortituous 
cancellations that shall not allow systematization and that may unpre- 
dictably vary from one problem to another. Similar compensations have 
already been observed in the literature [17,25,31]. 

• If wc denote by MC^m the MP 2 reference MC and, by MCg£ p , the RHF 
one, we may use eqs. (@|, 



di 2 (MCS f p 2 ,MC) < d 12 (MC^ f p 2 ,MC^HF) +di 2 (MC^HF,MC) , (6a) 
dia(MCj£, a ,MC) > |d 12 (MC r ^ P2 ,MC^ F ) - d 12 (MC^ F ,MC)| , (6b) 

to notice that, since di 2 (MC^jp 2 , MCrh F ) ~ 1A2RT, for any model chem- 
istry MC that is close to the RHF-intramethod reference, i.e., that present 
a small di 2 (MCp HF , MC), we have that 
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Homolevels 
L' E //6-31G(d) 



d,-, I RT - MP2 model chemistries 



Figure 14: Distance to their respective references of the MP2- and RHF- 
intramethod MCs corresponding to the same combination of basis sets B l E // B l G , 
expressed in units of RT at 300° K. Only the region with d\ 2 < 1.5RT is shown, 
and the best-fit line is depicted with a dotted line. 



d 12 (MCS f P2 ,MC) ~ d 12 (MC^ P2 ,MC^ F ) ~ 1A2RT . (7) 

This set of RHF-intramethod MCs that are close to the RHF reference, 
and that present the approximately constant value of di2(MC}^p2: MC) 
above, can be associated to the lower group encircled in fig. [T51 

All the points above illustrate what we have already advanced at the begin- 
ning of sec. 13.21 that the accuracy (or the efficiency, if computational time is 
included in the discussion) of any MC with respect to a good RHF reference, 
such as the RHF/6-311++G(2df,2pd) one, cannot be transferred to higher lev- 
els of the theory and, therefore, any such comparison must be seen as providing 
information only about the infinite basis set Hartree-Fock limit. 

To close this section, let us approach the question of the RHF — ► MP2 
transferability of the results from a different angle. 

We have proved in the preceding paragraphs that the study of RHF-intra- 
method MCs comparing them to a good RHF reference cannot be used for 
predicting the accuracy of those MCs with respect to a probably better MP2 
reference. Now, in sec. 13.21 MP2-intramethod MCs have been compared to 
the MP2/6-311++G(2df,2pd) homolcvcl, which, in turn, has been shown to 
be close to the infinite basis set MP2 limit. However, this level of the theory 
is very demanding computationally: the whole 12x12 grid of points in the 
PES of HCO-L-Ala-NH2 has taken ~ 3 years of computer time in 3.20 GHz 
PIV machines, while the one calculated at RHF/6-311++G(2df,2pd) has taken 
'only' ~ 6 months (see sec. 12. 1|) . 
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Therefore, we have decided to check whether or not the accuracy of a given 
RHF-intramethod MC with respect to the RHF reference is indicative of the 
accuracy of the MP2-intramethod MC that uses the same basis sets with re- 
spect to its own MP2 reference. The answer to this question is in fig. [T3J 
There, each point corresponds to a given combination of basis sets B l E //B l G and, 
in the x-axis, the distance between the associated MP2 MC and the MP2/6- 
311++G(2df,2pd) reference is shown. In the y-axis, on the other hand, we 
present the distance of the analogous RHF MC to the RHF/6-311++G(2df,2pd) 
homolevel. 

Although, since we have had to restrict ourselves to that combinations that 
were present both in sec. 13.11 and in sec. 13.21 the set of MCs is smaller in this 
case, the conclusion extracted is that the correlation is more significant than 
before: r ~ 0.92 if we use all the MCs, and r ~ 0.80 if we remove the 3-21G 
homolevel, which is very inaccurate in both cases, from the set. This indicates 
that, although some details might be lost, the relative efficiency of Gaussian 
basis sets in RHF-intramethod studies provides hints about their performance at 
MP2, and it partially justifies the structure of the investigation presented here. 

Finally, the overall situation described in this section and the relations among 
all the intramethod MCs studied are schematically depicted in fig. [T2] 

3.4 MP2//RHF-intermethod model chemistries 

In the final part of the study presented here, we investigate the efficiency of 
heterolevel MCs in which the geometry is calculated at RHF and, then, a single- 
point energy calculation is performed on top of it at MP2. They shall be termed 
MP2/I ' RHF-intermethod MCs. 

To this end, the RHF geometries that are used are those computed with the 
8 basis sets in table HI Like in sec. 13.21 they have been selected from those in 
table [2] looking for the most efficient ones, but also trying to reasonably sample 
the whole group of basis sets, in order to check whether or not the behaviours 
and signals observed in the remaining parts of the study are repeated here. 
The MP2 single-points, on the other hand, are computed with the whole set of 
possibilities in tabled] 

In fig. [13 we present an efficiency plot, using the MP2/6-311++G(2df,2pd) 
homolevel as reference MC, and containing all the MP2//MP2 MCs studied in 
sec. l3.2l together with the new 88 possible MP2//RHF-intermethod combinations 
of the form MP2/B i E //'RHF/B i G . 

Some conclusions can be drawn from this plot: 



3-21G 6-31G(d) 6-31+G(d) 6-311+G(d) 

6-31G 6-31G(2d,2p) 6-31++G(2d,2p) 6-311++G(2df,2pd) 



Table 6: Basis sets investigated for calculating the geometry in the MP2// RHF- 
intermethod part of the study. 
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Figure 15: Efficiency plot of all the MP2//MP2 MCs in fig. [IT] together 
with the new 88 possible MP2//RHF-intermethod combinations of the form 
MP2 / B* E //RHF / Bq introduced in this section. In the z-axis, we show the dis- 
tance G?i2, in units of RT at 300° K, between any given MC and the reference 
one (the homolevel MP2/6-311++G(2df,2pd), indicated by an encircled point), 
while, in the y-axis, we present in logarithmic scale the average computational 
time per point of the 12x12 grid defined in the Ramachandran space of the 
model dipeptide HCO-L-Ala-NH 2 . The different accuracy regions depending on 
di2, are labeled, and the 10% of the time ibest taken by the reference homolevel 
MP2/6-311++G(2df,2pd) is also indicated. 



• Due to the larger computational demands of the MP2 method, even the 
MCs whose geometry has been computed at the highest RHF level, the one 
with the 6-311++G(2df,2pd) basis set, are much cheaper than the MP2 
reference. Their times are slightly larger than the 10% of ibest, whereas 
all the rest of MP2//RHF MCs take less than that bound. 

• For all the RHF geometries, the MCs whose MP2 single-point has been 
calculated with 3-21G, 6-31G, 6-31++G and most of the 6-311+G(d) ones 
lie above d\ 2 = RT, so that they should not be used even on dipeptides. 
This is related to the lst-row atoms polarization gap observed in previous 
sections, although the signal is not so strong here. 

• The rest of MP2//RHF MCs not included in the two previous points lie 
at the efficient region, defined as that for which d\ 2 < RT and t < 10% 
of tbest- This confirms the heterolevel assumption also in the intermcthod 
context. 

• However, no MP2//RHF-intcrmcthod MCs, not even the ones with the 
single-point calculated at the highest MP2/6-311++G(2df,2pd) level, lie 
in the protein region. Therefore, if we want to approximate the reference 
MP2 results for peptides longer than 100 residues, under the assumptions 
in sec. 12 .21 the geometry calculation must be performed at MP2. In fact, 
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this is the correct way of addressing the discussion between Csasar [91] 
and Schafer [23,114], in which the former defends the position that the 
geometry can be calculated at RHF (provided that a subsequent MP2 
single-point is performed on top of it), while the latter disagrees and rec- 
ommends to compute the geometry at MP2 too. The data they argue 
about is. of course, the same; the discrepancy simply arises from the fact 
that Csasar is thinking only in the small systems in which the calculations 
have been performed, while Schafer wants to use the information obtained 
to gain understanding about the folding of long peptides. In this work, the 
distance introduced in ref. 92 and summarized in sec. l2.2l codifies whether 
or not two different MCs yield equivalent PESs for the HCO-L-Ala-NH 2 
dipeptide (i.e., if d < RT, they are physically indistinguishable at tem- 
perature T). On the other hand, if we assume a particular form in which 
the polypeptide potential is constructed from the single-residue PESs and 
use the additivity properties of d [92], we can show that this error grows 
with the square root \fN of the number of residues and use the quantity 
N Tea , defined in sec. 12.21 to estimate how the difference between the two 
MCs affects the behaviour of polypeptides. In a work in progress in our 
group [95] , we are investigating how the distance scales with N for differ- 
ent and more realistic hypotheses. It is in this sense, that the statements 
relying on N rcs should be regarded as an estimation. 

• Finally, let us point out that there is no accuracy region where the MP2- 
homolevel MCs are more efficient than the rest. 

Now, in fig. 116) the efficient region of the previous plot is enlarged and, due 
to the large number of MP2//RHF MCs studied, two subplots are produced for 
visual comfort: the one in fig. [T5k . in which the MCs sharing the same RHF 
level for the geometry have been joined by dotted lines, and the one in fig. [T6b . 
in which the MCs sharing the same MP2 level for the single-point calculations 
have been joined by broken lines. 

Let us remark some interesting facts that can be seen in these two more 
detailed plots: 

• The leftmost group of five MP2//RHF MCs that show the highest accu- 
racy are those in which the geometry has been obtained with basis sets 
containing lst-row atoms polarization functions and the single-point en- 
ergy calculation has been performed at MP2/6-311++G(2df,2pd). In par- 
ticular, the MP2/6-311++G(2df,2pd)//RHF/6-31G(d) PES can correctly 
approximate the reference one up to peptides of ~ 25 residues, under the 
assumptions mentioned above, at around 1% its computational cost. This 
supports the heterolevel assumption for MP2//RHF-intcrmcthod MCs. 

• The RHF geometries calculated with the unpolarized basis sets 3-2 1G and 
6-31G arc, in general, less accurate than the rest, however, due to their 
low computational cost, they turn out to be the most efficient ones from 
d\2 — QART on. Remarkably, 3-21G is more efficient than 6-31G. 
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Figure 16: Selected region of the efficiency plot in fig. [15] In (a), the MCs 
sharing the same RHF level for the geometry have been joined by dotted lines 
and the basis set used for that part of the calculation is indicated. In (b) , the 
MCs sharing the same MP2 level for the single-point calculations have been 
joined by broken lines and the corresponding basis set labels are also shown. 
The order in which the points have been joined in both cases has no meaning 
at all and it is only intended for visual convenience. 
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Efficient MP2//MP2 and MP2//RHF MCs 


dia/KT a 


N ies b 


t c 


MP2/6-311++G(2df,2pd)//MP2/6-31G(d) 


0.046 


468.3 


1.90% 


MP2/6-311++G(2df,2pd)//RHF/6-31G(d) 


0.194 


26.7 


1.48% 


MP2/6-31++G(2d,2p)//RHF/6-31G(d) 


0.367 


7.4 


0.37% 


MP2/6-31++G(2d,2p)//RHF/3-21G 


0.417 


5.7 


0.27% 


MP2/6-31+G(d)//RHF/3-21G 


0.645 


2.4 


0.06% 


MP2/6-31G(d)//RHF/3-21G 


0.831 


1.4 


0.05% 


MP2/6-31++G//RHF/3-21G 


1.033 


0.9 


0.05% 


MP2/6-31G//RHF/3-21G 


1.263 


0.6 


0.05% 


MP2/3-21G//RHF/3-21G 


3.043 


0.1 


0.05% 



Table 7: List of the most efficient MP2//MP2 and MP2//RHF MCs located at the 
lower-left envelope of the cloud of points in fig. 1151 The first block contains the only 
MP2//MP2 MC in the list, the second one the MP2//RHF MCs with a distance d 12 
below RT, and the third one those that are inaccurate even for dipeptides. "Distance 
with the reference MC (the homolevel MP2/6-311++G(2df,2pd)), in units of RT at 
300° K. 'Maximum number of residues in a polypeptide potential up to which the 
corresponding MC may correctly approximate the reference, under the assumptions in 
sec. 12.21 c Required computer time, expressed as a fraction of tbeat- 

• In fig. 116b. we can observe that, for the medium-sized basis sets 6-31++G 
(d,p), 6-31+G(d), 6-31G(d,p) and 6-31G(d), the single-point accuracy is 
rather insensitive to their differences and they may be used interchange- 
ably. There is, however, a weak signal, in the region of unpolarized RHF 
geometries, indicating that the addition of diffuse functions may increase 
the quality of the energy calculations at MP2. 

• The relative accuracy of the MCs whose MP2 single-point has been com- 
puted at 6-31++G(2d,2p) and at 6-31G(2d,2p) suggests that, like in pre- 
vious parts of the study, it is a good idea to add diffuse functions to basis 
sets that contain doubly-split polarizations shells, also for the MP2 energy 
calculations in MP2//RHF-intermethod MCs. 

• Like it happened in sec 13. li in fig. [T6k . we notice that there is no real 
improvement if we calculate the RHF geometry beyond 6-31G(d). So 
that, an accumulation point is reached for RHF geometries in MP2/ / RHF- 
intermethod MCs. 

Finally, in table we present the most efficient MCs that lie at the lower- left 
envelope of the plot in fig. 1151 These arc the most efficient MCs found in this 
work. 
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4 Conclusions 



In this study, we have investigated more than 250 PESs of the model dipeptide 
HCO-L-Ala-NH 2 calculated with homo- and heterolevel RHF//RHF, MP2//MP2 
and MP2//RHF MCs. All of the PESs are available as supplementary material. 
As far as we are aware, the highest-level PESs in the literature, the MP2/6- 
311++G(2df,2pd) homolevel in fig. \7\ has been used as a reference and all the 
rest of calculations have been compared to it (except for sec. 13.11 where the 
RHF//RHF MCs have been compared to RHF/6-311++G(2df,2pd)). The data 
and the results extracted are so extense that we have considered convenient to 
give here a brief summary of the most important ones. 

The first conclusion that we want to point out is that, for the largest basis set 
evaluated here, the 6-311++G(2df,2pd) one, for which the RHF and MP2 limits 
appear to have been reached, the convergence in method has not been achieved. 
I.e., the distance between the MP2 and RHF references is di2 — 1A2RT, so 
that the latter cannot be used to approximate the former even for dipeptides. 
Therefore, we discourage the use of RHF// RHF MCs for peptides, and, unless 
otherwise stated, most of the conclusions below should be understood as re- 
ferring either to MP2//MP2-intramethod or to MP2//RHF-intermethod MCs, 
which have proved to be acceptably accurate with respect to the best MP2 
calculation. 

The second observation related to the comparison between RHF and MP2 is 
that the RHF — > MP2 transferability of the relative accuracies between MCs is 
imperfect, and the conclusions regarding the relative efficiency of the different 
basis sets arrived using the former cannot be directly extrapolated to the latter. 
This point has two distinct aspects: On the one hand, we have shown that 
to compare RHF//RHF MCs to a good RHF reference gives little information 
about their accuracy with respect to a good MP2 MC. On the other hand, 
the comparison with the RHF reference of PESs calculated with MCs of the 
form KHF / B l E //RHF / Bq may provide useful information about the relative 
accuracy of the analogous MP2/ B E //MP2/ B l G MCs with respect to their own 
MP 2 reference. 

Now, keeping these considerations in mind, let us summarize the most im- 
portant conclusions pertaining the relative efficiency of the Pople split-valence 
basis sets investigated: 

• In the whole study, the polarization shells in lst-row atoms have been 
shown to be essential to accurately account for both the conformational 
dependence of the geometry and of the energy of the system. Except for 
some particular MCs with 3-21G geometries, which may be used if we plan 
to describe short oligopeptides, our recommendation is that polarization 
functions in lst-row atoms be included. 

• In most cases, we have also observed a strong signal indicating that no 
basis sets should be used containing doubly-split polarization shells and no 
diffuse functions. 
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• The 6-31G(d) basis set, which is frequently used in the literature, [8,9,12, 
15,16,29,97,101,111,115,116], has turned out to be a very efficient one 
for calculating the geometry both at RHF and MP2. 

• Regarding the basis set convergence issue, we can conclude that, for the 
largest basis sets in the Pople split-valence family, both the RHF and MP2 
infinite basis set limits are approximately reached. 

• Finally, some weaker signals have been observed suggesting that to add 
higher angular momentum polarization shells (f,d) before adding the lower 
ones may be inefficient, that it is not recommendable to put polarization 
or diffuse functions on hydrogens only, and that it may be efficient in some 
cases to add diffuse functions to singly-polarized basis sets. 

Referring to the heterolevel assumption, which, as far as we are aware, has 
been tested in this work for the first time in full PESs: 

• As a general and very clear conclusion, since only some small-basis set 
homolevels lie in the lower-left envelope of the efficiency plots presented 
in the previous sections, and, in all cases, it happens for distances d\2 
greater than RT, we can say that the heterolevel assumption is correct for 
the description of the conformational behaviour of the system studied here 
with MP2//MP2 and MP2//RHF MCs (also for RHF//RHF-heterolevels 
but, as we remarked above, this has little computational interest). 

• Due to the much stronger dependence of the accuracy of MCs on the 
level used for the single-point than on the one used for the geometry 
optimization, together with the lower computational cost of the former, 
the general recommendation is that the greatest computational effort be 
dedicated to the energy calculation. 



• Despite this general thumb rule, and under the assumptions in sec. 12 .21 
if one wants to approximate the MP2 reference calculation for peptides 
of more than 100 residues, the geometry must be calculated using MP2. 
Nevertheless, with small and cheap basis sets, such as 6-31G(d), the 
MP2//MP2 results can be good enough at a low computational cost. 

Finally, let us remark that the investigation performed here has been done 
in one of the simplest dipeptides. The fact that we have treated it as an isolated 
system, the small size of its side chain and also its aliphatic character, all play a 
role in the results obtained. Hence, for bulkier residues included in polypeptides, 
and, specially for those that are charged or may participate in hydrogen-bonds, 
the conclusions drawn about the relative importance of the different type of 
functions in the basis set, as well as those regarding the comparison between 
RHF and MP2, should be approached with caution and much interesting work 
remains to be done. 

All the PESs investigated are publicly available as supplementary material 



at http : //www.pabloechenique . com/f iles/public/suppjnaterials/ Each 
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one of them is a three-column text file containing, in this order, the values of 
the Ramachandrand angles (f> and ip in the 12x12 grid defined in sec. 12.11 and 
the energy in hartrees. They are organized in subfolders indicating whether 
they correspond to RHF- or MP2-homolevels, or to RHF//RHF-, MP2//MP2- 
or MP2//RHF-hctcrolcvcls. The filenames arc explicative (note that a letter V 
has been used to indicate that a particular Gaussian shell is missing in either 
the lst-row atoms or the hydrogens). 
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